Why the kDS Data Source Discovery Platform Works — Data-Driven Design, Responsible AI, and Humans in the Middle

June 4, 2026

Over the past two years, this blog has documented the kDS Data Source Discovery (DSD) App one part at a time — the eight-schema PostgreSQL design, the insert-only data approach, the migration of LLM prompts to database-driven Jinja2 templates, and most recently the two-year evolution of the schema itself. Read end to end, the archive tells a single story: a discovery platform that treats data structure as the foundation, AI as a disciplined tool, and people as the irreplaceable source of truth.

This post steps back from any single feature to answer a bigger question — why does this design work? The answer comes down to three properties that reinforce one another: it is genuinely data-driven, it uses AI responsibly and efficiently, and it keeps humans in the middle of the workflow by design rather than by accident.

The kDS DSD App does not ask AI to guess where your enterprise data lives. It asks the people who already know, captures their answers in a structured pipeline, and uses AI to synthesize what they tell it — with a traceable line back to every source.

1. A Data-Driven, Flexible Design

The platform's defining characteristic is that the database is the application. As the schema evolution post argued, the kDS database is not a general-purpose store bolted onto an app — it is the app's data flow expressed in relational form. Every structural decision corresponds to a step in the discovery workflow, and that is exactly what makes the design both data-driven and flexible.

Structure that mirrors the workflow

The eight schemas are not organizational tidiness; they are functional zones, each operating on data at a different stage of its lifecycle. Data enters in human form, is validated and staged, is promoted into formalized relational structures when the interview process begins, flows through the AI pipeline, and is synthesized into cross-organizational analysis.

Schema	Role in the Data Flow
stage	Intake and validation. Natural-keyed tables that receive raw contact and organizational data before interview generation.
client	The formalized hierarchy: parent → subsidiary → business unit → role. All surrogate UUID keys.
model	Authored, reusable discovery interview templates, keyed by industry, business function, and role.
interview	The live session: generated questions, respondent answers, summaries, and the AI analysis pipeline outputs.
analysis	Cross-interview synthesis. Role-level and business-unit-level consolidation with full provenance.
admin	Accounts, access control, billing, projects, and the database-driven prompt library.
reference	Stable lookups: NAICS-derived industry codes and geographic locations, shared as foreign-key anchors.
temp	Transient working storage for in-progress operations. Ephemeral by design.

The natural-to-surrogate key promotion

The most important design decision in the platform is also its most flexible: the moment a record graduates from the stage schema to the client schema. In staging, records carry human-readable natural keys — an email within an organizational context, a business unit name within its parent and subsidiary. There are no foreign keys and no indexes, because intake data is short-lived and its relationships are still implied. When an admin initiates interview generation, those records are promoted into permanent schemas, where they receive a UUID surrogate key from gen_random_uuid() and the database engine takes over enforcement of integrity.

The intake dock and the production floor.
This two-phase key design is what makes the platform adaptable. The intake layer stays legible to people and lightweight to operate; the production layer stays rigorous and queryable. New industries, new roles, and new organizational structures flow in through staging without disturbing the formalized graph downstream — the schema bends at the intake dock and stays firm on the production floor.

Flexibility without losing rigor

Two more choices keep the design supple. First, free-form SME answers are stored as jsonb, preserving the richness of human conversation inside a relational structure that can still be queried, indexed, and cross-referenced. Second, the platform follows a primarily insert-only approach: rather than overwriting records, it inserts new ones and mirrors the rare necessary updates into history tables via PostgreSQL triggers. The result is a full audit trail, cheaper writes, and a system that can grow without erasing where it has been.

2. Responsible and Efficient Use of AI

It would be easy to point GPT-4 at a pile of documents and ask it to describe an organization's data landscape. It would also be unreliable. The kDS platform takes the opposite path: it gives AI a narrow, well-defined job, feeds it clean structured input, and keeps every output traceable.

Structured input, defensible output

Every prompt the app sends is assembled from clean, typed, normalized relational records — role descriptions, industry classifications, prompt templates, topic-level summaries. This matters because a language model cannot flag ambiguous or malformed context; it simply produces confident-sounding output regardless. When inputs are well-structured, outputs are proportionally more coherent and more defensible. Responsible AI use begins long before the prompt — it begins in the schema.

A traceable analysis pipeline

The AI pipeline runs in stages rather than one opaque pass. Raw answers in interview.answer are summarized into interview.summary, then lifted to analysis.by_role (consolidating all respondents who share a role) and analysis.by_business_unit. Each consolidated result is stored as jsonb alongside UUID array columns — source_summary_id_set, interview_id_set — that trace every finding back to the specific interviews that produced it. That provenance chain, from raw answer to strategic insight, is simply not possible with unstructured documents or ad-hoc spreadsheets. If a finding is questioned, it can be walked back to its source.

Configuration, not improvisation

As documented in the Jinja2 migration post, the platform moved its LLM prompts out of code and into the admin.prompt_setting table. Each row is a named, versioned prompt template the Flask app retrieves at runtime. Tuning how GPT-4 generates questions or synthesizes summaries requires a database update, not a code deployment. The prompt is configuration; the database is the configuration store. This keeps AI behavior auditable, versioned, and governed — the opposite of improvised prompting.

Efficient by construction

Efficiency follows from the same discipline. Reusable model.discovery templates mean a single, peer-reviewed question set serves many respondents across many engagements without duplication. Interview generation runs on background threads with throttling between calls, so the platform makes the AI calls it needs and no more. Structured context also reduces wasted tokens and re-prompting. AI is used where it adds genuine value — synthesizing patterns across many human conversations — and nowhere it doesn't.

3. Why Humans Are Essential to the Workflow

For all the attention the AI pipeline gets, the platform's deepest premise is that the most valuable enterprise knowledge is not written down anywhere. It lives in people — what we have called data dark matter and tribal knowledge. No amount of document scanning surfaces the undocumented dependency a fifteen-year veteran carries in her head. The only reliable way to capture it is to ask her. Humans are not a step the platform tolerates; they are the source it depends on.

Subject-matter experts are the source of record

The entire pipeline begins with people answering questions. SMEs describe where data lives, who owns it, and how it flows — in their own words, captured as jsonb so nothing is flattened or lost. The AI's job is to organize and synthesize what people say, never to invent it. Remove the humans and there is nothing to discover.

Humans gate the workflow at every promotion

People are not just at the start of the pipeline — they govern its transitions. An admin decides when a staged contact is promoted into the formal interview engine. A new discovery model lives in stage.model until a human reviews and approves it before promotion to model.discovery. The role confirmation workflow asks SMEs and their managers to confirm identity and responsibility before an interview is generated, with token-authenticated links and a status lifecycle (pending_generation → pending_sme_review → sme_approved → admin_validated → approved). Every natural-to-surrogate promotion is, in effect, a point where a human says “yes, this is correct — formalize it.”

We call this design humans in the middle. The platform does not aim to remove people from data discovery; it aims to give their knowledge structure, durability, and reach.

Judgment AI cannot supply

The synthesized output of analysis.by_role and analysis.by_business_unit is a starting point for human review, not a verdict. People bring the context, accountability, and judgment that turn a consolidated summary into a decision — what we describe as data quality diligence. The provenance arrays exist precisely so a human reviewer can audit any AI finding against its sources. The platform is built to make human oversight easy, not optional.

The Three Properties Reinforce Each Other

These are not three separate virtues — they are one design seen from three angles. The data-driven structure is what makes responsible AI possible: clean, typed records produce coherent, traceable output. Responsible AI is what makes human oversight meaningful: provenance lets people audit findings instead of trusting a black box. And keeping humans in the middle is what keeps the data trustworthy in the first place, because the source of truth is the people the platform interviews. Take any one away and the other two weaken.

This is what we mean by pre-agentic data infrastructure: the structured, human-validated foundation that AI agents will require before they can act reliably on enterprise data — not merely answer questions about it. The kDS DSD App is being built for the moment when “where does this data live, and can we trust it?” stops being a research project and becomes a prerequisite for automation.

Interested in what this architecture can do for your organization's undocumented data landscape? Reach out at talk2us@keeshinds.com or visit keeshinds.com. We are actively onboarding select beta partners.

As always, thanks for stopping by!